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Abstract 

Many multivariate data analysis techniques for an to x n matrix Y are related to the model 
Y = M + E, where Y is an to x n matrix of full rank and M is an unobserved mean matrix 
of rank K < [m An). Typically the rank of M is estimated in a heuristic way and then the 
least-squares estimate of M is obtained via the singular value decomposition of Y, yielding an 
estimate that can have a very high variance. In this paper we suggest a model-based alternative 
to the above approach by providing prior distributions and posterior estimation for the rank 
of M and the components of its singular value decomposition. In addition to providing more 
accurate inference, such an approach has the advantage of being extendable to more general 
data- analysis situations, such as inference in the presence of missing data and estimation in a 
generalized linear modeling framework. 

Some key words: interaction, model selection, multiplicative effects, multiple hypergeometric 
function, relational data, social network, Steifel manifold. 

1 Introduction 

Many inferential and descriptive methods for multivariate and matrix- valued data are variations on 
the idea of modeling the data matrix Y as equal to a reduced-rank mean matrix M plus Gaussian 
noise, and estimating M after deciding upon its rank. A concept that is central to such models is the 
singular value decomposition: every mx n matrix M has a representation of the form M = UDV' 
where, in the case m > n, U is an m x n matrix with orthonormal columns, V is an n x n matrix 
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with orthonormal columns and D is an n x tt, diagonal matrix, with diagonal elements {di, . . . ,dn} 
typically taken to be a decreasing sequence of non- negative numbers. The triple {U, D,V} is 
called the singular value decomposition of M. The squared elements of the diagonal of D are the 
eigenvalues of M'M and the columns of V are the corresponding eigenvectors. The matrix U can 
be obtained from the first n eigenvectors of MM'. The number of non-zero elements of D is the 
rank of M. 

The appeal of the singular value decomposition is partly due to its interpretation as a multi- 
plicative model based on row and column factors. Given a model of the form Y = M + E, the 
elements of Y can be written yij = u'-'Dvj+Cij, where Uj and vj are the ith and jth rows of U and 
V respectively. This model thus provides a representation of the systematic variation among the 
entries of Y by row and column factors. Models of this type play a role in the analysis of relational 
data (Harshman et al., 1982), biplots (Gabriel 1971, Gower and Hand 1996) and in reduced-rank 
interaction models for factorial designs (Gabriel 1978, 1998). The model is also closely related 
to factor analysis, in which the row vectors of Y are modeled as i.i.d. samples from the model 
Yj = UjDV' + Bi- In this situation, the goal is typically to represent the covariance across the n 
columns by their relationship to K < n unobserved latent factors. Lawley and Maxwell (1971) pro- 
vide a comprehensive overview of factor analysis models, as well as methods of maximum likelihood 
estimation for model parameters. 

The singular value decomposition also plays a role in parameter estimation for the reduced- 
rank model: Assuming the rank of the mean matrix M is < n and letting (U,D, V) be the 
singular value decomposition of the data matrix Y, the least-squares estimate of M (and maximum 
likelihood estimate under Gaussian noise) is given by M^' = ^li:K]i^[i:K,i:K]^[,i:K]j where U[^i:i^] 
denotes the first K columns of U and D[i:/^j:/^] denotes the first K rows and columns of D 
(Householder and Young 1938, Gabriel 1978). In applications such as signal processing, image 
analysis, and more recently large-scale gene expression data, representing a noisy data matrix Y 
by with K « n has the effect of capturing the main patterns of Y while eliminating much 
of the noise. 

Despite its utility and simplicity, several issues limit the use of the singular value decomposition 
as an estimation procedure. The first is that the rank K of the approximating mean matrix M^^ 
must be specified. Standard practice is to plot the singular values of Y in decreasing order and then 
select K to be the index where the last "large gap," or "elbow" occurs. The second issue is that, 
even if the rank is chosen correctly, one may be concerned with its variance: For high-dimensional 
parameter spaces, the mean squared error of least-squares estimates is often higher than that of 
penalized or Bayes estimates. Finally, non-model-based approaches are somewhat limited in their 
ability to handle missing or non-normal data. 

Philosophical debates aside, Bayesian methods often provide sensible procedures for model se- 
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lection and high-dimensional parameter estimation. For the model described above, a Bayesian 
procedure would provide a mapping from a prior distribution p(U, D,V, cr^) to a posterior dis- 
tribution p(U, D, V, cr^|Y). Of primary interest might be functions of this posterior distribution, 
such as £^(M|Y) or the marginal posterior distribution of the rank p(i^|Y) oc p{K)p(Y\K). Both 
of these quantities require integration over the complicated, high-dimensional space of {U,D,V} 
for each value oi K. In the related factor analysis model where the elements of U arc modeled as 
independent normal random variables, the difficulty in calculating marginal probabilities has led to 
the development of approximate Bayesian procedures: Rajan and Rayncr (1997) provide a coarse 
approximation to the marginal probability p(Y\K) by plugging in maximum-likelihood estimates. 
Minka (2000) improves on this by providing a Laplace approximation to the desired marginal prob- 
ability. Both of these procedures rely on asymptotic approximations, and do not provide Bayesian 
estimates of M once the dimension has been selected. 

Alternatively one could turn to Markov chain Monte Carlo methods to obtain approximations to 
the integrals that are necessary for model selection. Tierney (1994) describes the theory of MCMC 
for very general parameter spaces, and Green (1995) outlines conditions under which the reversibil- 
ity of the Metropolis-Hastings algorithm is maintained for model-selection problems. However, 
obtaining efficient proposal distributions for high-dimensional problems is a difficult and delicate 
art. For the factor analysis problem. Lopes and West (2004) devise a workable reversible-jump 
algorithm by constructing proposal distributions that approximate the within-model full condi- 
tional distributions of the model parameters. In this way, their approach mimics aspects of the 
Gibbs sampler. However, their approximate full conditional distributions are derived from auxiliary 
within-model MCMC algorithms, requiring an extra Markov chain for each rank K to be consid- 
ered. As a result, this approach requires pre-specification of the values of K, leaving open the 
possibility that computational efi"ort is spent on values of K with negligible posterior probability, 
or that values of K with high posterior probability arc overlooked altogether. 

In the analysis of relational data such as social and biological networks, the row heterogeneity 
and column heterogeneity of Y are of equal interest, thus motivating a singular value decomposition 
approach to the data analysis as opposed to a factor analysis. The purpose of this paper is to provide 
the necessary calculations for model selection, estimation and inference for statistical models based 
on the singular value decomposition. The results in the following sections provide a means of Markov 
chain Monte Carlo approximation to the posterior distribution of the rank and values of the matrix 
M. This MCMC algorithm is based on Gibbs sampling and, unlike the algorithm of Lopes and 
West (2002), requires no auxiliary runs or pre-specification of the values of K. Additionally, this 
model-based method allows for estimation in the presence of missing data or replications, and can 
be incorporated into a generalized linear modeling framework to allow for the analysis of a variety 
of data types. For example. Section 6 discusses a model extension that allows for the analysis and 
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prediction of binary relational data such as social or biological networks. 

In the next section we discuss prior distributions for {U, D,V} given a fixed rank K, and 
show how the uniform distribution for U (the invariant measure on the Steifel manifold) may be 

specified in terms of the full conditional distributions of its column vectors. Section 3 presents 
a Gibbs sampling scheme for parameter estimation when the rank of M is specified. In the case 
of unspecified rank, estimation can be achieved via a prior distribution which allows the diagonal 
elements of D to each be zero with non-zero probability. In Section 4 we consider posterior inference 
under such a prior distribution, and develop a Markov chain Monte Carlo algorithm which moves 
between models with different ranks. This algorithm is constructed via a Gibbs sampling scheme 
which samples each singular value dj from its conditional distribTition. This is done marginally over 
U[j] and V[ j], and requires a complicated but manageable integration. Section 5 presents a small 
simulation study that examines the sampling properties of the Bayesian procedure. It is shown 
that the procedure is able to estimate the true rank of M reasonably well for a variety of matrix 
sizes, and the squared error of the Bayes estimate £'(M|Y) is typically much lower than that of 
the least squares estimator. In Section 6 the singular value decomposition model is extended to 
accommodate non-normal data via a generalized linear modeling framework, which is then used in 
an example analysis of binary relational data. A discussion follows in Section 7. 

2 The SVD model and prior distributions 

As described above, our model for an m x n data matrix is Y = M + E, where M is a rank K matrix 
and E is a matrix of i.i.d. mean-zero normally-distributed noise. We induce a prior distribTition on 
the matrix M by way of a prior distribution on the components of its singular value decomposition 
{U,D,V}. 

For a given rank K, we can take U to be an m x ii' orthonormal matrix. The set of such matrices 
is called the Steifel manifold and is denoted VK,m- A natural, non-informative prior distribution 
for U is the uniform distribution on VK,m, which is the unique probability measure on VK,m that is 
invariant under left and right orthogonal transformations. As discussed in Chikuse (2003, Section 
2.5), a sample U from the uniform distribution on the Steifel manifold VK,m may be obtained 
by first sampling an m x K matrix X of independent standard normal random variables and then 
setting U = X(X'X)~^/^. Although this construction is straightforward, it doesn't explicitly specify 
conditional distributions of the form p(U[ j]|U[ _j]), which are quantities that will be required for 
the estimation procedure outlined in Section 3. We now derive these conditional distributions via 
a new iterative method of generating samples from the uniform distribution on VK,m- 

Let U[^^] denote the columns of U corresponding to a subset of column labels ^ C {1, . . . , K}, 
and let be any m x (m — \A\) matrix whose columns form an orthonormal basis for the null 
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space of U[^^]. A random U G VK,m can be constructed as follows: 

1. Sample Ui uniformly from the unit m-sphere and set y = ui; 

2. Sample U2 uniformly from the unit (m — l)-sphere and set U[ 2] = N|i}U2; 

K. Sample uk uniformly from the unit {m — K + l)-sphere and set U[ = Nji j^_i}Ui^. 

By construction this procedure generates an m x K matrix U having orthonormal columns. The 
following result also holds: 

Proposition 1 The probability distribution o/U is the uniform probability measure on VK,m- 

A proof is provided in the Appendix. Since this probability distribution is invariant under left and 
right orthogonal transformations of U (see, for example, Chikuse 2003), it follows that the rows and 
columns of U are exchangeable. As a result, the conditional distribution of U[ j] given any subset 
A of columns of U is equal to the distribution of N/iUj, where Uj is uniformly distributed on the 
{m — |^|)-sphere. This fact facilitates the Gibbs sampling of the columns of U and V from their 
full conditional distributions, as described in Section 3. For simplicity we proceed with posterior 
calculations using this uniform prior, even though the full conditionals derived in Section 3 indicate 
that a more general, informative conjugate family could be used with no additional computational 
difficulty. 

For a given rank the non-zero singular values {di, . . . , dx} which make up the diagonal 
of D determine the magnitude of the mean matrix, in that ||M|p = 'Ylik=i'^\- model these 
non-zero values as being samples from a normal population with mean /i and precision (inverse- 
variance) For reasons of conciseness we discuss posterior calculations only for conjugate prior 
distributions on these parameters, which include a normal distribution with mean and variance 
Vq for //, and a gamma(7yo/2, ?yoTQ/2) distribution for ijj, parameterized so that •i/' has expectation 
I/tq. Other potentially useful prior distributions, along with choices for hyperparameters, are 
discussed in Section 5. This parameterization of the singular values differs slightly from that 
of the usual singular value decomposition, in that the values {di, . . . , cix} are not restricted to 
be non-negative here. A model enforcing this restriction is possible, but adds a small amount 
of computational difficulty without any modeling benefit (if A is a diagonal matrix of ibl's, then 
p(Y|U, D, V) = p(Y|UA, AD, V)). Finally, the elements of E are modeled as i.i.d. normal random 
variables with mean zero and variance 1/0. The prior distribution for the precision (p is taken to be 
gamma(i/o/2, ^00^/2). A graphical representation of the model and parameters is given in Figure 
n Choices for hyperparameters {(/io,fo)) ivo^'^o)' (^Oj^^'o)} discussed in Section 5. 
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uniform(Vi^,m) ^ U 

normal(/i, iZ-i/;) ~ D ►{Y = UDV' + E} " — E ~ normal (0, !/</>) 

uniform(Vi4', V 

Figure 1: A graphical representation of the model 

3 Gibbs sampling for the fixed-rank model 

A Markov chain with p(\J, D, V, 0, fi, ip\Y, K) as its stationary distribution can be constructed via 
a Gibbs sampling procedure, which iteratively samples (j), /i, ifj and the columns of U, D and V from 
their full conditional distributions. These samples can be used to approximate the joint posterior 
distribution and estimate posterior quantities of interest (see, for example, Tierney 1994). 

The full conditional distributions for (p^ /i, ip and the elements of D are standard and are provided 
below without derivation. Less standard are the full conditional distributions of the columns of 
U and V. To derive these, consider the form of p(Y|U, D, V, i;^)) as a function of U[j], Vjj] and 
dj = D[j,j]- Letting E_j = Y — U[^_j]D[_j _j]V| we have 

||Y-UDV'||2 = ||E_,-d,U[,,]V|_^.]||2 

= ||E_,|p - 2(i,Uj^^.]E^,V[,,.] + \\d,\J^,^Y[J\^ 
= ||E_,||2-2d,Uj^^.]E_,V[,,.]+4. 

It follows that p(Y|U, D, V, (p) can be written 

(, \ mn/2 -j -j 

£j exp{--0||E_,||2 + pd,lJ[.^-E.,Yy] - -<t>d]}. (1) 

Recall that conditional on Uj_j], U[ j] = N|_^.|Uj, where is a basis for the null space of 

columns of _j] and Uj is uniform on the m — {K — l)-sphere. From we see that the full 
conditional distribution of \ij is proportional to exp{u^((;/)(ijN|l^|E_j V[ j])}. This is a von Mises- 
Fisher distribution on the m — {K — l)-sphere with parameter (/)(ijN|^^|E_j V[ jj. A sample of 
from its full conditional distribution can therefore be generated by sampling Uj from the von Mises- 
Fisher distribution and then setting U[j] = N|_^.|Uj. The full conditional distribution of V[ jj is 
derived similarly. In general, the von Mises-Fisher distribution on the p-sphere with parameter 
/X S has density Cp(||/2||) expju'/x} and is denoted vMF(/2), and the uniform distribution on the 
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sphere is denoted vMF(O). The normahzing constants for these two cases are 



Cp{K) = (27r)-f/2 



.p/2-1 



r(p/2) 

27rP/2 



for At > 0, Cp(0) = 



for K = 0, 



where Iu{x) is the modified Bessel function of the first kind (see Section 9.6 of Abramowitz and 
Stegun, 1972). R-code for samphng from this distribution is provided at my website. 

Summarizing these results, a Markov chain with the desired stationary distribution can be 
constructed by iterating the following procedure: 

• Fori G {!,..., K}, 

- sample (U[j]|Y, U[^_^], D, V, </>) = N]^_^.jUj, where u^- ~ vMF((^djN^^E_j-V[j]); 

- sample (V[j] | Y, U, D, V[,_j-], </>) = N^'.^.^v^-, where ~ vMF((/)djUj^^.]E_jN|_^.|); 

- sample (dj|Y, U, D[_j V, 0, /x, V') ^ normal [(U|^^.]E_jV[j](/>+/iV)/(?^'+V'), l/(</'+'0)]; 

• sample (^|Y,U,D,V) ~ gamma[(z/o + mn)/2, (z/qct^ + ||Y - UDV'|p)/2]; 

• sample (At|D,V') ~ xioixaal[{'il}Y, dj + iiq/vI)/ {'iIjK + l/vl),l/ {'iIjK + l/vl)] 

• sample (V'|D, /x) ~ gamma[(r7o + K)/2, {rjoT^ + ^{dj - /x)2)/2]; 

4 The variable-rank model 
4.1 Prior distributions 

In this section we extend the model of Section 2 to the case where the rank K is to be estimated. 
This requires comparisons between models with parameter spaces of different dimension. Two 
standard ways of viewing such problems are as follows: 

• Conceptualize a different parameter space for each value of K, i.e., conditional on K, the 
mean matrix is UDV' where the dimensions of U, D and V are m x K, K x K and n x K 
respectively. 

• Parameterize U, D and V to be of dimensions m x n, n x n and nx n, but allow for columns 
of these matrices to be identically zero. In this parameterization, K = X^j=i i{dj 7^ 0). 

Each of these two approaches has its own notational and conceptual hurdles, and which one to 
present is to some extent a matter of style (see Green 2003 for a discussion). Given a prior 
distribution on K, the first approach can be formulated by using the prior distributions of Section 
2 as the conditional distributions of U, D and V given K. The second approach can be made 
equivalent to the first as follows: 
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1. Let U, D, V have the prior distributions described in Section 2 with K = n; 

2. Let {si, . . . , Sn} p{K = Sj) x (^"^ .) ^ , where each Sj G {0, 1}; 

3. Let S = diag{si,...,s„}. Set U = US,D = DS, V = VS, K = Ysj. 

Parameterizing a set of nested models with binary variables has been a useful technique in a variety 
of contexts, including variable selection in regression models (Mitchell and Beauchamp 1988). We 
continue with this formulation because it allows for the construction of a relatively straightforward 
Gibbs sampling scheme to generate samples from the posterior distribution. 

The matrices U, D and V described in 1, 2 and 3 above are exchangeable under simultaneous 
permutation of their columns. It follows from Proposition 1 that, conditional on the 
non-zero columns of U and V are random samples from the uniform distributions on sj,m 
Vj2sj,n respectively, and that conditional on {sj = 1, U[ V[ 

U[.] = ^i-j}^, V[,] £ N|_^.jv, where 

• ^{-j} ^^'^ ^{-j} orthonormal bases for the null spaces of U[ _j] and V[ 

• u and V are uniformly distributed on the {m — Yl + 1)- and {n — Yl + l)-spheres. 

This property will facilitate posterior sampling of the columns of U, D and V, as described in the 
next subsection. 

4.2 Posterior estimation 

Let = {U,D, V}, &j = {U[ j], dj, V[ j]} and & j = {@k '■ ^ j} ■ this subsection we derive 
the full conditional distribution of 0j given {Y, 0_j, 0, /i, i/;} under the model described in the 
previous subsection. The prior and full conditional distributions of cp, fi and remain unchanged 
from Section 2. The full conditional distributions can be used in a Gibbs sampling scheme to 
generate approximate samples from p(U, D, V, (p, /n, V'|Y). 

Under the model and parameterization described above, the components of @j are either all 
zero or have a distribution as described in Section 2. To sample @j, we first sample whether or 
not the components are zero, and if not, sample the non-zero values. More specifically, sampling 
@j from its full conditional distribution can be achieved as follows: 

1. Sample from {{dj = 0}, {dj ^ 0}) conditional on Y, ^A, M, V' • 

2. If {dj = 0} is true, then set dj,\Jyj and V[ all equal to zero. 

3. If {dj / 0} is true, 

(a) sample dj\Y, &-j, (f), fi, ip, {dj ^ 0}; 
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(b) sample {U[j],V[j]}\Y,@^j,(j),dj. 

The steps 1, 2, and 3 above constitute a draw from p{@j\Y,&^j,(j), fJ,,^p). The first step requires 
calculation of the odds: 

oddsfa,- 7^ Y , 0_,-, ©, it, w) = ,^ , x — 7 ^ — (2) 

The first ratio is simply the prior conditional odds of {dj 7^ 0} and can be derived from the prior 
distribution on the rank K. The second term in ^ can be viewed as a Bayes factor, evaluating the 
evidence in the data for additional structure in -^[Y] beyond that provided by Recall from 

the previous section that Y — UDV' = E_j — (ijU[ jjVj^-j, and so we can write 



p(Y|U,D,V,(/>,^,^) 



mn/2 -. 



exp{-^(/>4}exp{</>(ijUj,^.]E_jV[j]} 



2 



= p{Y\&^j,dj = 0,0,//,^) X exp{--(l)d]} X exp{(/)a!jU'^^.]E_jV[j]} (3) 

The first term in is equal to the denominator of the Bayes factor, and is simply a product 
of normal densities with the elements of Y having means given by U[ _j]D[_j _j]Vj and equal 
variances 1/0. The numerator of the Bayes factor can be obtained by integrating (jS)) over 0^ with 
respect to its conditional distribution given fi,'ip,&-j and {dj 7^ 0}. Integrating first with respect 
to U[ J], V[ j], we need to calculate i?[exp{(/)djUj j]E_jV[ j]}|0_j, dj]. Let fh = m — Ylkj^ji^k / 0} 
and h = n — Ylk:^j{^k 0}- Recall that conditional on 0-j, U[ j] = N|__^|U and V[ = N|_^|V 
where u and v are uniformly distributed on the rh- and n-spheres. Letting E = N|^^|E_-,N|_^|, 
the required expectation can therefore be rewritten as Euv[exp{(/)(iju'Ev}]. This expectation is 
non-standard, and is derived in the appendix. The result gives: 

^ 00 

p(Y|0_,-,,/.,//,V',dj) =p(Y|0_,-,</),/x,V',di =0) xexp{--</.4}^||E||VS% (4) 

1=0 

where the sequence {ai}"^ can be computed exactly and is given in the appendix. 

The calculation of p(Y|0_j, cp, /i, ip, dj 7^ 0) is completed by integrating @ over dj with respect 
to p{dj\@^j,(p, fijtpjdj 7^ 0), the normal density with mean and precision ■0. This integration 
simply requires calculating the even moments of a normal distribution, resulting in 

00 

p(Y|0_,-,0,V,di 7^0) =p(Y|0_j,(/),0,d, =0) X ^||E||2'aj6; (5) 



1=0 



where the sequence {6;}g° is given by 
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where Z is standard normal. The required moments can be calculated iteratively, see for example 
Smith (1995). The conditional odds of {dj ^ 0} is therefore 

odds(d, / O|Y,0_,,^,/.,V) = y^Y^mfaih. 

p{dj = O|0_j) ^ 

In practice, only a finite number of terms can be used to compute the above sums. The sum in 
(jlj can be bounded above and below by modified Bessel functions, and the error in a finite-sum 
approximation can be bounded. This can also provide a guide as to how many terms to include in 
approximating ©. Details are given in the Appendix. 

If {dj 7^ 0} is sampled it is still necessary to sample dj,U[ j] and V[j]. Multiplying equation 
Q by the prior for dj \ {dj ^ 0} , the required conditional distribution for dj is proportional to 

oo 

p^-|Y,0_j-,0,/i,^,{dj / 0}) oc e-^('='^-'^)''^e-^'^?'^^||E||2V2'dfaz 

1=0 

which is an infinite mixture with the following components: 

• mixture weights: wi oc ||E|p'ai6; 

• mixture densities: fi{d) oc d^' exp{— ^((i — /x)^^'}, where /x = fj-^p/{(j) + V') and ip = cp + 

The density fi{d) is nonstandard, but can be sampled from quite efficiently using rejection sam- 
pling with a scaled and shifted i-distribution as the approximating density (the tails of a normal 
distribution are not heavy enough). 

To sample and ^[j] we first sample u and v from their joint distribution and then set 
U[j] = N|_^.|U and V[ = N|_^.|V. Equation (jSJ indicates that the joint conditional density of 
{u, v} is of the form 

p(u, v| Y, &-j, 4>, fi, ip, dj) = c(A) expju'Av}, (6) 

where A = (pdjE and c{A)^^ = Cm(0)^"'^Cfi(0)^^ ||A|p'ai. This density defines a joint dis- 
tribution for two dependent unit vectors. A similar distribution for dependent unit vectors was 
discussed in Jupp and Mardia (1980), except there the vectors were of the same length and the 
matrix A was assumed to be orthogonal. Some useful facts about the distribution in © include 

• the conditional distribution of u|v is vMF(Av), and that of v|u is vMF(u'A); 

• the marginal distribution of v is proportional to I^/2-i(ll-^v||)/||Av||™/^~^; 

• the joint density has local maxima at {ib(ufc, v^), = l,...,n} where (ufc,Vfc) are the kth 
singular vectors of A. 
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I have implemented a number of rejection and importance samplers for this distribution, although 
making these schemes efficient is still a work in progress. A relatively fast approximate method 
that seems to work well for a variety of matrices A is to first sample (u, v) from the local modes 
{±(ufc, Vfc), /c = 1, . . . ,n} according to the exact relative probabilities and then use this value as 
a starting point for a small number of Gibbs samples, alternately sampling from p(u|A,v) and 
p{v\A,u). 

We now outline a Gibbs sampling algorithm that moves within and between models of different 
dimensions, using the conditional distributions derived above and in Section 3: 

A. Variable dimension sampler: For each j G {1, . . . ,n}, sample @j = {U[ j], dj, V[j]} : 

• sample dj|Y, 0_j, ; 

• sample [IJ y])\Y, & (/), ii,ijj,dj. 

B. Fixed dimension sampler: For each {j : dj ^ 0}, 

• sample U[j]|Y, S j, cj), fi, ijj, dj, V[j]; 

• sample V[ j]|Y, 0_j, (/), yU, V', dj, U[j] ; 

• sample dj|Y, 0_j, 0, /x, V', U[j], V[j]. 

C. Other terms: 

• sample (/)| Y, ; 

• sample /u|D, ip; 

• sample jj,. 

The distributions required for the steps in A are outlined in this section, and steps B and C are 
outlined in the previous section. Technically, interating steps A and C alone will produce a Markov 
chain with the desired stationary ditribution, but adding the steps in B can improve the within- 
model mixing of the Markov chain at a relatively low computational cost. Also, by conditioning 
on whether or not dj = for each j, the steps in B can be viewed as Gibbs sampling for all 
j e {1, . . . ,n}, not just those for which dj / 0. R-code that implements this algorithm is available 
at my website. 

5 Simulation study 

In this section we examine the sampling properties of the estimation procedure with a small simu- 
lation study. Each dataset in this study was simulated from the following model: 

• U ~ uniform(V5,TO), V ~ uniform(V5,n) ; 
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• D = diag{(ii, . . . , d^}, {di, . . . , ^5} ~ i.i.d. uniform 

• Y = UDV' + E, where E is an m x n matrix of standard normal noise. 

For each value of m and n, the sampling mean of {di , . . . , ^5} was taken to be Hmn = \/n + m + 
Such a value should distribute the singular values {di, . . . , d^} near the "cusp" of detectability: As 
shown in Edelman (1988), the largest singular value of an m x n matrix E of standard normal noise 
is approximately /x^n for large m and n. 

Three-hundred datasets were generated using the model above, one-hundred for each of the three 
sample sizes (m,n) G {(10, 10), (100, 10), (100, 100)}. These were generated in the R statistical 
computing environment using the integers 1 through 100 as random seeds for each of the three 
sample sizes. Code to generate these datasets is available from my website. Prior distributions 
for the parameters {(f), fi, ip} were taken as described above with "prior sample sizes" of 1^0 = 2 
and r/o = 2. This gives exponential prior distributions for (p and ip. The values of (7q,^q and Tq 
were derived from "empirical Bayes"-type estimates obtained by averaging over different ranks as 
follows: 

1. For each k G {0, . . . ,n}, 

(a) Let UDV be the least-squares projection of Y onto the set of rank-A; matrices; 

(b) Let <t2 = ||Y - UDV llV(nm) 

(c) Let A, = E ■=! dj/k, fl = T!]-M - kVk. 

2. Let <yl = ^ Y.']=o <7|, /^o = ^ E"=o Ai, vl = l E]=oiH " A)^ ^0 = ^ Ei=o ^j- 

The resulting prior distributions are weakly centered around averages of empirical estimates, where 
the averaging is over ranks through n. Finally, the prior distribution on the rank K of the mean 
matrix was taken to be uniform on {0, . . . , n}. 

For each of the 100 x 3 datasets, 20, 000 iterations of the Gibbs sampling scheme described 
in Section 4.3 were run to obtain approximate samples from the posterior distribution of UDV'. 
Parameter values were saved every 10th scan after dropping the first 10,000 scans to allow for 
burn-in, resulting in 1000 Monte Carlo samples per simulation. All Markov chains were begun with 
K = and {(/>, n, tp} set equal to their prior modes. Summaries of the posterior distributions for the 
three different values of {m,n) are displayed in Figure |21 The first column of each panel plots the 
MCMC approximation to the expected value of p{K\Y) for each value of {m,n). The expectation 
Ey[p{K\Y)] is approximated by j^J2l^iPi'^\^{s))j where Y(5) is the sth simulated dataset for 
a given value of (m,n) (for the case m = n = 100, p{K\Y) is plotted only for K < 10, although 
the distribution extends beyond this value). These distributions are all peaked around the correct 
value of = 5. 
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Figure 2: Results of the simulation study. Plots in the first column give the averages of p{K\Y) 
over 100 simulated datasets. The second column gives the empirical distribution of the posterior 
mode K. The third column gives the distribution of the ratio of the squared error of the Bayes 
estimate of M to that of the least-squares estimate. 
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Estimator 


Sample size 


k 


ke 


kc 


ki 


(10,10) 


5, .35 


2, .00 


4, .14 


1, .00 


(100,10) 


5, .66 


3, .09 


4, .27 


5, .46 


(100,100) 


5, .45 


6, .21 


36, .00 


4, .24 



Table 1: Comparison of model selection procedures. Mode of the sampling distribution of K and 
the probability that K = 5 for different estimators of the rank of M. 

Also of interest is how frequently the posterior mode K = arg maxp(K\Y) obtains the true value 
of K. This information is displayed in the second column of Figure |21 which gives the empirical 
distribution of K taken over each of the 100 datasets. As we see, the true value K = 5 is the 
most frequent value of the estimate in each dataset. For each dataset we also computed three other 
estimates of K: Ki, the Laplace approximation of Minka (2000); kc, the number of eigenvalues of 
the correlation matrix of Y that are larger than 1, often suggested in the factor analysis literature; 
and Ke, the index of the largest gap in the eigenvalues of Y'Y, used in machine learning and 
clustering (see, for example, Meila and Xu 2003). Descriptions of the sampling distributions of these 
estimators are presented in Tabled The only case in which the peak of the sampling distribution 
for one of these estimators obtained the correct value was ki in the case of m = 100, n = 10, 
although the sampling distribution was less peaked around the true value than that of the Bayes 
estimate {Pr{Ki = 5|Y) = .46 versus Pr(i^ = 5|Y) = .66). In the case of m = 10,n = 10, Ki did 
poorly, with Pr(/C/ = 1|Y) = .86. Finally, we note that kc behaved extremely poorly for the case 
m = n = 100, having a sampling distribution centered around K = 36. This is perhaps due to the 
fact that the asymptotic behavior of eigenvalues for square matrices are quite different than that 
of rectangular matrices (see Edelman, 1988). R-code to obtain all estimators for these simulated 
data are available at my website. 

Perhaps of more importance than an estimate of k is an accurate estimate of M. In the last 
column of Figure [21 we compare the error of the model-averaged estimate of M to that of the least- 
squares estimate. For each simulated dataset the posterior mean M = i?[M|Y] was obtained by 
averaging its value over the last 10^ scans of the Gibbs sampler. The squared error in estimation, 
averaged over elements of the mean matrix was calculated as ASE^ = ||M — M| p/(mn) where 
M is the mean matrix that generated the data. This value is compared to ASE^.s', which is the 
corresponding average squared error of the least-squares projection of Y onto the space of rank-X 
matrices. The distribution of this ratio is mostly below 1 for the case m = n = 10, and strictly 
below 1 for the other two cases where there are more parameters to estimate. This corresponds 
with our intuition: The model-averaged estimates improve relative to the least-squares estimates 
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as the number of parameters increases. These results indicate that simply obtaining a posterior 
estimate K of K and then using the corresponding lank-K least-squares estimate of M generally 
results in an estimate that can be substantially improved upon by model averaging, at least in 
terms of this error criterion. 

As a simple summary of the mixing properties of these Markov chains, we examined the conver- 
gence and autocorrelation of the marginal samples of the error precision using Geweke's (1992) 
z-test of stationarity, and by calculating the effective sample size of the 1000 Monte Carlo samples 
saved for each chain. We declared a Markov Chain simulation a "success" if Geweke's z-statistic 
did not exceed 2 in absolute value and if the effective sample size was at least 100. Based on this 
criterion, the percentage of MCMC simulations that were successful was 82%, 86% and 91% for the 
three sample sizes {(10, 10), (100, 10), (100, 100)} respectively. The simulation "failures"' were not 
examined extensively, but we conjecture that running these chains longer is likely to result improved 
estimation. Of course, in the analysis of a single dataset the usual recommendation is to assess the 
convergence and autocorrelation of the Markov chain and to adjust the length accordingly. 

Finally, the performance of the model was examined using two additional prior distributions. 
The posterior analysis was first rerun using a "diffuse" conjugate prior distribution, in which <p ~ ex- 
ponential(l), ip ~ exponential(l) and /x ~ normal(0, The modes of the sampling distribution 
of K were 6, 5 and 5 for the sample sizes (10,10), (100,10) and (100,100) respectively, with Pt{K = 
5) equal to .21, .53 and .32 for the three different cases, indicating slightly worse performance than 
the empirical Bayes prior distribution, but better performance than the other approaches. A modi- 
fication to this prior was also implemented, in which n ~ normal(0~^/^ y^n + m + 2^nm, l/ip) and 
the other distributions remained the same. This prior distribution focuses the search for non-zero 
dj's to values that are as large as the largest singular values of normally distributed noise matrices. 
Such a prior distribution may be appealing in practice, as it requires that factors entering into the 
model have a reasonable magnitude. Use of this prior distribution resulted in sampling distribu- 
tions for K having modes of 5 for each of the sample sizes, with Pr{K = 5) equal to .32, .55, .42. A 
more complicated alternative to these prior distributions would be to have the prior distributions 
for {(/), /X, tp} depend on K. For example, given K = k, the prior distributions for {cp, /x, tp} could 
be based on {<7^, Afei'^^fe}- Such prior distributions would require some minor modifications to the 
variable-dimension sampler outlined in the previous section. 
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6 Extension and example: analysis of binary relational data 



A potentially useful extension of the model described above is to a class of generalized bilinear 
models of the form 

where g is the link function. Such models allow for the analysis of a variety of data types: 
For example, binary data can be modeled as Uij ~ binary ( i^^^g. \ | ) and count data as yij ~ 
Poisson(exp{0jj}). Gabriel (1998) considered maximum likelihood estimation for a variant of this 
model in situations where the dimension of D is fixed, and Hoff (2005) considered a symmetric 
version of this model for the analysis of social network data. Parameter estimation and dimension 
selection for the above model can be made by sampling from a Markov chain generated by a mod- 
ified version of the algorithm of Section 4.3. Given current values of 0, /3, U, D, V, sample new 
values as follows: 

1. Let Y = — X/3 = UDV' + E. Update U, D, and V from their conditional distribution 
given Y as described in Section 4.3. 

2. Let Y = — UDV' = X/3 + E. Update /3 from its conditional distribution given Y (a 
multivariate normal distribution). 

3. Sample 0* = X/3 + UDV' + E*, where E* is a matrix of normally distributed noise with 
zero mean and precision (j). Replace 6ij by 9* j with probability 1- 

We illustrate the use of such a model and estimation procedure with an analysis of binary 
relational data between 46 global service firms and 55 cities, obtained from the Globalization and 
World Cities study group (|http : / / www . Ibor o . ac . uk/ gawc ) . For these data, yij = 1 if firm j 
has an office in city i and yij = otherwise. Standard practice is to represent within-row and 
within-column homogeneity with effects that are additive on the log-odds scale: 

logodds(yij = 1) = P + ai + bj, (7) 

and so the effects a = {ai, . . . , am} and b = . . . , bn} constitute a rank-two structure. We look 
for evidence of higher-order structure by considering the model 

logodds(yij = 1) = P + (8) 
7ij = u/Dvj + 

The rank-two structure of model ((Tj) is easily incorporated into © by fixing Uj = -^1,^x1 and 
Vr 91 = ^ Inxi and modeling di and d2 to be non-zero with probability 1. The additive city and 
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Figure 3: Posterior estimation of K. The first panel plots values of K every 100th scan of the 
Markov chain. The second panel plots the Monte Carlo estimate of p{K\'Y). The third panel gives 
the results of a cross-validation evaluation of E {0, . . . , 10}. 

firm effects are then given by a = d2U[ 2] and b = diVj x] respectively. Note that any remaining 
effects represented by UDV' will be orthogonal to these additive effects, and that the mean of 
the matrix UDV' is identically zero, making it unaliased with the intercept p. For the remainder 
of this analysis, the variable K will refer to the number of additional non-zero singular values of 
UDV' beyond the additive row and column effects. 

We fix the error variance l/i;^> = 1 , as this scaling parameter is confounded with the magnitude 
of f3 and UDV'. For simplicity we use independent normal (0, 100) prior distributions for (3 and the 
non-zero elements of D, and a uniform prior distribution for K. A Markov chain of length 25,000 
was constructed using the algorithm described above, starting with K = 0. Mixing across ranks K 
was rapid as is shown in the first panel of Figure 01 which displays values of K every 100th scan 
of the Markov chain. Stationarity of the Markov chain in K was not rejected at level 0.05 based 
on Geweke's z-test, and the effective sample size for estimating the posterior distribution of K was 
472. The Monte Carlo estimate oi p{K\y), shown in the second panel, gives a posterior mode of 
K = 6 and suggests strong evidence for structure in the log-odds beyond that of the additive row 
and column effects. 

One of the practical motivations for selecting an appropriate model dimension is prediction. 
Many binary social network datasets include missing values, in which it is not known whether 
Uij = 1 or Uij = 0. In such cases it is often desirable to make predictions about missing values 
based on the observed data, and thus to base model selection on predictive performance. With this 
in mind, we compare the above results to the following 10-fold cross validation procedure: 



1. Randomly split the set of pairs into ten test sets Ai, . . . , Aiq. 
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2. FoT K = 0,1,... ,K, 



(a) For / = 1, . . . , 10 : 

i. With the rank fixed at K, perform the MCMC algorithm using only {yij : 
Ai}, but sample values of 9ij for all ordered pairs. 

ii. Based on the Monte Carlo sample values {^j- j\ • • • > j^} compute the posterior mean 
Aij = s Ss=i {hj} £ A and the log predictive probability \pp{Ai) = 

^{^d}eA^ logP(yi,ilAi,i)- 

(b) Measure the predictive performance for K as LPP(i^) = lpp{Ai). 

The values of — 2LPP(i^) for K £ {0, . . . , 10} are shown in the third panel of Figure (jSJ. For 
the particular random partitioning of the data used here, the cross-validation procedure suggests 
a model rank of K = 6, which is the same value as the posterior mode of the Bayes solution. 
However, a comparison of values of K using a ten-fold cross validation procedure requires the 
construction of 10 x separate Markov chains, and further requires specification of the values of 
K to be compared. In contrast, the Bayesian procedure requires only one MCMC run and can 
potentially visit each value of K £ {1, . . . ,n}. 

Finally we examine some of the patterns in the structure of UDV' beyond those of the additive 
effects. The posterior mean of UDV, minus the additive effects, was obtained by averaging over 
scans of the Markov chain. The first two singular values and vectors of this matrix were obtained, 
and the values of the resulting row (city) effects are plotted in Figure 0J These values are strongly 
related to geography: U.S. cities cluster together, as do cities in Europe, Latin America and from 
the Pacific rim. 

7 Discussion 

This paper has presented a model-based version of the singular value decomposition, thereby ex- 
tending a general data analysis tool that has a wide variety of data analysis purposes and interpre- 
tations. For example, in this article it is used for noise reduction and estimation of a mean matrix 
(Section 5), as well as prediction and data description (Section 6). 

The approach taken in this paper is to model the data matrix Y as equal to a reduced-rank 
mean matrix M plus Gaussian noise, and to simultaneously estimate M along with its rank. The 
approach is Bayesian and the estimation procedure, based on Markov chain Monte Carlo, allows 
for a variety of model extensions, such as to the generalized bilinear models described in Section 
6, estimation using replicate data matrices and estimation subject to missing data. This latter 
extension may be of particular use in the analysis of relational data among a large number of 
nodes, where it may be too costly to make observations on all possible pairs. In such cases, the 
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Figure 4: City specific effects: The first two left singular vectors of £^(M|Y) indicate strong 
geographic patterns in the data. 
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value of Uij may be missing for many pairs, but one can make predictions based on estimates 
Ui,D,Vj obtained from the observed data. Using this approach to predict missing Unks in social 
networks and protein-protein interaction networks is one of my current research areas. However, 
for large datasets with 1000 nodes (10^ observations) or more, the MCMC scheme in this article 
becomes prohibitively computationally expensive. I am currently studying methods of making 
approximate Bayesian inference for large relational datasets. These include Laplace approximations 
for various components of the MCMC scheme of Sections 3 and 4, and using variational methods 
for approximating joint posterior distributions (Jordan et al., 1999). 

Although based on the conceptually straightforward Gibbs sampler, complexity of the full con- 
ditional distributions used in the Markov chain of Section 4 suggest wc look for an alternative 
procedure. For example, we could model only dj to be zero with non-zero probability, and sample 
from its full conditional distribution instead of marginally over Vy] and V[ j]. Unfortunately, an 
algorithm based on this approach will not mix well across ranks of M because dj, U[ j] and V[ 
are dependent to an extreme: The probability of sampling dj 7^ is essentially zero unless U[ and 
V[ j] are near a pair of local modes, but the probability of U[j] and V[ j] being in such a state is 
essentially zero if dj = 0. An alternative approach to sampling from distributions on complicated 
sample spaces is to use Metropolis-Hastings type algorithms. Based on my initial work on this 
problem, obtaining proposal distributions for these algorithms that achieve even minimal accep- 
tance rates requires an extreme amount of tuning. In contrast, Gibbs sampling for this model is 
possible as shown in this article, requires no tuning or pre-specification of model dimensions to be 
considered, and, for the examples in this article, mixes well across matrices M of different ranks. 

Computer code and data for all numerical results in this paper are available at 
www . Stat . Washington . edu/hof f . 

A Proof of proposition 1 

We first construct a sample from the uniform distribution on VK,m and then show that it has 
the desired conditional distributions. Let zi, . . . ,zx be i.i.d. multivariate normal (0, I^xm)- Let 
xi = zi and for j = 1, . . . , iiT — 1 let 

• Xj- = (xi---xj-); 

. p, =i-x,(x;.x,)-ix;.; 

Note that Pj is the symmetric, idempotent projection matrix of M"^ onto the null space of Xj, and 
so the vectors xi, . . . ,Xj+i are orthogonal. For each j, let JJj = Xj(X^Xj)~^/^. For j = K, we 
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have 



X' X 



Xn 



(Xi X2 • • • Xk) 










|X2| 








\y^K\ 



and so 



/ xi X2 y^K 
'|Xi| |X2| |xx| 



is a matrix of if orthonormal vectors in W^. The proof will be complete if we can show the 
following: 

Lemma 1: The distribution of \}k is the uniform distribution on Vx,m- 

Lemma 2: U[ fc+i]|U/fc = NfcU,t+i where is an orthonormal basis for the null space of Ujt and 
Ufc+i is distributed uniformly on the m — k dimensional sphere. 

Proof of Lemma 1. By Theorem 2.2.1 (iii) of Chikuse (2003), an m x if matrix of the form 
Xx(X^Xx)~^'^^ is uniformly distributed on VK,m if X^ is an m x if random matrix with rank K 
a.s. and having a distribution that is invariant under left-orthogonal transformations. We will show 
left invariancc for each constructed above by induction. Let H : M™ M™ be an orthogonal 
transformation, and note that HXi = Hxi = xi = Xi. Now suppose HX^ = X^. The distribution 
of HXfe+i is determined by its characteristic function: 

fc+l k 

£;[exp{i^t^Hxj}] = £;[exp{i^t^Hxj}£;[exp{itfc+iHxfc+i}|Xfe]] 



3=1 



Note that tj^^^Hx^+i = (P'^H'tfc+i)'zfc+i, where z^+i is a vector of independent standard normals 
and independent of X^. Thus the characteristic function can be rewritten as 

£;[exp{i ^ t^Hx,} exp{-^t'fe+iHPfeP;H'tfe+i}] = S[exp{i ^ t^x,} exp{-it;+iPfctfe+i}] (9) 



where Xj = Hxj and 



Pjk = HPfeP'fcH' = HPfeH' 

= H(I-X,(X',Xfe)-iX',)H' 

= I-HXfe((HXfe)'(HXfe))-i(HXfe)' 

A similar calculation shows that the distribution of Xj is characterized by 



k+l k J 

E[ex.p{i t'jXj}] = E[ex^p{i ^ t^Xj} exp{--tk+i'Pktk+i}], 
j=i j=i 



(10) 
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By assumption, = HX^, and so {xi, . . . ,Xfe,Pfc} = {xi, . . . ,Xfc,Pjt} and the expectations © 



and (|in|) are equal. Since the characteristic functions specify the distributions, HX^+i = X^+i 
and the lemma is proved. 

Proof of Lemma 2: The vector U[fc_|_x] is constructed as Uj^^^ij = PfcZfc+i/|PfcZfc+i|. Pfc has 
m — k eigenvalues of one, the rest being zero, giving the eigenvalue decomposition P^. = N^N^ 
where is a m x (m — k) matrix whose columns form an orthonormal basis for the null space of 
Ufc. Substituting in N^N';, for P^ gives 



Ur 



Nlz 



(z'NfcN;NfcN',z)V2 
(z'NfeNlz)i/2 



Note that for each k, = Xfc(X'^Xfc)~"'^/^, and so the projection matrix P^ can be written as 
I — UfcU'^j, a function of U^. Therefore, given Ufc, U[^^,+i] is equal in distribution to (a function 
of Ufc) multiplied by N'i^z/\'N'i.z\. The distribution of N'^z can be found via its characteristic 
function: For an m — fc-vector t 

^[exp{it'(N;z)}] = ^[exp{i(Nfet)'z}] 
= exp{-^t'N;Nfct} 

= exp{--tt}, 

and so we see that N^Zfe+i is equal in distribution to an m — A;-vector of independent standard 
normal random variables, and so NfcZfc+i/|NfcZfc_|.i| is uniformly distributed on on the m — /c-sphere. 



B Expectation of the bilinear form 

In this section we compute E[e^''^'^] for uniformly distributed unit vectors u and v and an arbitrary 
mx n matrix A. Integrating with respect to v can be accomplished by noting that as a function of 
V, e'^'"^^ is proportional to the von Mises-Fisher distribution on the n-sphere Sn, with parameter 
u'A: 
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J e"'A^p(v) dSn{v) = J e"'^^c„(0) dSn{v) 



Cn(0) 



Cn{\\n'A\ 
c„(0) 



e"'A^c„(||u'A||) dSniv) 



Cn(l|u'A||) 

= r(n/2)(2/||u'A||r/2-i4/2_i(||u'A||) 

where Ii, is the modified Bessel function of the first kind. The series expansion of /„/2-i(||u'A||) 
gives 

r(„/2,(2/||u'A||r/-'/„,,_.(||u'A||) = I l|u-A||^ V(,+ l)r(f+„/2)4- - 

All the terms in the sum are positive, so ii^[e'^'''^^] can be found by replacing ||u'A|p' with its 
expectation in the above equation. To compute this expectation, let A = LA^^^R' be the singular 
value decomposition of A, where L'L = R'R = I and A is a diagonal matrix of the eigenvalues of 
A'A. Then 

||u'A|p = u'AA'u 

= u 'LAV2r'raV2l' u 
= u'LAL'u 
= u'Au 

n 

where u = L'u. We will now identify the distribution of the vector {ul, . . . , u^}. Let B = {L, L"^} 
be an orthonormal basis for R"*. Since the uniform distribution on the sphere is rotationally 
invariant, B'u is equal in distribution to u, and so L'u is equal in distribution to the first n 
coordinates of u. Recall that a uniformly distributed vector u can be generated by sampling 
zi, . . . ,Zm independently from a standard normal distribution and then dividing each term by 
|5]z2|V2. Therefore, 



{zl... 





















{zf, . . . , z^} 



23 



where 9 ~ beta(n/2, (m — n)/2), q ~ Dirichletn(l/2, . . . , 1/2) and and q are independent. There- 
fore, ||u'A|p = ^A'q, where A is the diagonal of A and are the eigenvalues of A' A. The required 
expectation is then 

E[\\u'A\f^] = E[e^]E[{X'ciy] 

The first expectation is given by E[9'-] = [r(n/2 + Z)r(m/2)]/[r(m/2 + Z)r(n/2)]. The second 
expectation is the Zth-moment of a Dirichlet average, which results in a type of multiple hypergeo- 
metric function denoted as Ri{X, ^1). This expectation and its generalizations have been studied 
by Carlson (1977, Chapter 5), Dickey (1983) and others. An algorithm for recursively computing 
Ri, . . . ,Ri exactly from a generating function is provided in the next section. 

To make the result of the calculation a little more intuitive let A = A/^Aj = A/||A|p, and 
make use of the fact that £^[(A'q)^] = ||A|p^£^[(A'q)^], Combining the results gives 

and so we see how the expectation is related to the norm of A via ||A|P and the variability in 
relative sizes of the squared singular values via £'[(A'q)']. 

To get bounds on a finite-sum approximation to ^^[e"'-'^^], note that X]^^^ < £'[(A'q)'] < Aj^ax 

so 

, r(m/2) ^ , r(m/2) ^ , T{m/2) 

™^'^r(m/2 + /)r(l + Z)4^ /-^ ^^^r(m/2 + Z)r(l + Z)4' "^^r(m/2 + Z)r(l + /)4' 

The outer sums can be computed as 

^ ^ r(m/2 + Z)r(l + 1)41 - ^TxJ V2-i(VA)r(m/2) - ^ A f^^^^J^^p^^^ ^ ^^^i^ 



l=r+l 



and so bounds on E[e^'-^'^] — Yll=o 11-^11^''^/ be obtained. 



C Computing the multiple hypergeometric function 

Let q ~ Dirichlet„(ai, . . . ,Q;„). Carlson (1977, Section 6.6) shows that 

i=l 1=0 ^ J \ ' J 

Let Q = p(Q^)r(^!)^]^) -^[(^^q)^] • We now show how to calculate Ck+i based on ci,...,Cfc. Let 
f{t) = Yl'^o cit'' be the right-hand side of the equation and g{t) = — J2i^=i '^i log(l ~ be the log 
of the left-hand side. Taking derivatives with respect to t and evaluating at zero we have 

n 

/(0(0)=r(Z + l)q, 5^0(0) =r(/)^a,Ai. 

i=l 
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Since f(t) = e^^*\ we have 

;(^+i)(0)=gQ/(0(o)5(^+i-0(o). 
Plugging the values of f^''\0) into the sum gives 

k 

cfc+1 = 

1=0 

Simplifying gives 

k 

^[(A'q)^+i] = ^ 

1=0 

C-code with an R-interface to calculate {£'[(A'q)' : I = 0, . . . ,k} is available at my website. 
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